{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mastering Gradient Boosting with CatBoost"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial we will use dataset Amazon Employee Access Challenge from [Kaggle](https://www.kaggle.com) competition for our experiments. [Here](https://www.kaggle.com/c/amazon-employee-access-challenge/data) is the link to the challenge, that we will be exploring."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Libraries installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install --user --upgrade catboost\n",
    "#!pip install --user --upgrade ipywidgets\n",
    "#!pip install shap\n",
    "#!pip install sklearn\n",
    "#!jupyter nbextension enable --py widgetsnbextension"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "np.set_printoptions(precision=4)\n",
    "\n",
    "import catboost\n",
    "print(catboost.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost.datasets import amazon\n",
    "\n",
    "# If you have \"URLError: SSL: CERTIFICATE_VERIFY_FAILED\" uncomment next two lines:\n",
    "# import ssl\n",
    "# ssl._create_default_https_context = ssl._create_unverified_context\n",
    "\n",
    "# If you have any other error:\n",
    "# Download datasets from http://bit.ly/2ZUXTSv and uncomment next line:\n",
    "# train_df = pd.read_csv('train.csv', sep=',', header='infer')\n",
    "\n",
    "(train_df, test_df) = amazon()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Label values extraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = train_df.ACTION\n",
    "X = train_df.drop('ACTION', axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Categorical features declaration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_features = list(range(0, X.shape[1]))\n",
    "print(cat_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking on label balance in dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Labels: {}'.format(set(y)))\n",
    "print('Zero count = {}, One count = {}'.format(len(y) - sum(y), sum(y)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training the first model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost import CatBoostClassifier\n",
    "model = CatBoostClassifier(iterations=100)\n",
    "model.fit(X, y, cat_features=cat_features, verbose=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.predict_proba(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several ways of passing dataset to training - using X,y (the initial matrix) or using Pool class.\n",
    "Pool class is the class for storing the dataset. In the next few blocks we'll explore the ways to create a Pool object.\n",
    "\n",
    "You can use Pool class if the dataset has more than just X and y (for example, it has sample weights or groups) or if the dataset is large and it takes long time to read it into python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost import Pool\n",
    "pool = Pool(data=X, label=y, cat_features=cat_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split your data into train and validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data = train_test_split(X, y, test_size=0.2, random_state=0)\n",
    "X_train, X_validation, y_train, y_validation = data\n",
    "\n",
    "train_pool = Pool(\n",
    "    data=X_train, \n",
    "    label=y_train, \n",
    "    cat_features=cat_features\n",
    ")\n",
    "\n",
    "validation_pool = Pool(\n",
    "    data=X_validation, \n",
    "    label=y_validation, \n",
    "    cat_features=cat_features\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting the objective function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Possible options for binary classification:\n",
    "\n",
    "`Logloss` for binary target.\n",
    "\n",
    "`CrossEntropy` for probabilities in target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = CatBoostClassifier(\n",
    "    iterations=5,\n",
    "    learning_rate=0.1,\n",
    "    # loss_function='CrossEntropy'\n",
    ")\n",
    "model.fit(train_pool, eval_set=validation_pool, verbose=False)\n",
    "\n",
    "print('Model is fitted: {}'.format(model.is_fitted()))\n",
    "print('Model params:\\n{}'.format(model.get_params()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stdout of the training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "model = CatBoostClassifier(\n",
    "    iterations=15,\n",
    "#     verbose=5,\n",
    ")\n",
    "model.fit(train_pool, eval_set=validation_pool);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics calculation and graph plotting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = CatBoostClassifier(\n",
    "    iterations=50,\n",
    "    learning_rate=0.5,\n",
    "    custom_loss=['AUC', 'Accuracy']\n",
    ")\n",
    "\n",
    "model.fit(\n",
    "    train_pool,\n",
    "    eval_set=validation_pool,\n",
    "    verbose=False,\n",
    "    plot=True\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model comparison"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model1 = CatBoostClassifier(\n",
    "    learning_rate=0.7,\n",
    "    iterations=100,\n",
    "    train_dir='learing_rate_0.7'\n",
    ")\n",
    "\n",
    "model2 = CatBoostClassifier(\n",
    "    learning_rate=0.01,\n",
    "    iterations=100,\n",
    "    train_dir='learing_rate_0.01'\n",
    ")\n",
    "\n",
    "model1.fit(train_pool, eval_set=validation_pool, verbose=20)\n",
    "model2.fit(train_pool, eval_set=validation_pool, verbose=20);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost import MetricVisualizer\n",
    "MetricVisualizer(['learing_rate_0.7', 'learing_rate_0.01']).start()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Best iteration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = CatBoostClassifier(\n",
    "    iterations=100,\n",
    "#     use_best_model=False\n",
    ")\n",
    "model.fit(\n",
    "    train_pool,\n",
    "    eval_set=validation_pool,\n",
    "    verbose=False,\n",
    "    plot=True\n",
    ");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Tree count: ' + str(model.tree_count_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost import cv\n",
    "\n",
    "params = {\n",
    "    'loss_function': 'Logloss',\n",
    "    'iterations': 80,\n",
    "    'custom_loss': 'AUC',\n",
    "    'learning_rate': 0.5,\n",
    "}\n",
    "\n",
    "cv_data = cv(\n",
    "    params = params,\n",
    "    pool = train_pool,\n",
    "    fold_count=5,\n",
    "    shuffle=True,\n",
    "    partition_random_seed=0,\n",
    "    plot=True,\n",
    "    verbose=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv_data.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_value = np.min(cv_data['test-Logloss-mean'])\n",
    "best_iter = np.argmin(cv_data['test-Logloss-mean'])\n",
    "\n",
    "print('Best validation Logloss score, not stratified: {:.4f}±{:.4f} on step {}'.format(\n",
    "    best_value,\n",
    "    cv_data['test-Logloss-std'][best_iter],\n",
    "    best_iter)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost import cv\n",
    "\n",
    "params = {\n",
    "    'loss_function': 'Logloss',\n",
    "    'iterations': 80,\n",
    "    'custom_loss': 'AUC',\n",
    "    'learning_rate': 0.5,\n",
    "}\n",
    "\n",
    "cv_data = cv(\n",
    "    params = params,\n",
    "    pool = train_pool,\n",
    "    fold_count=5,\n",
    "    shuffle=True,\n",
    "    partition_random_seed=0,\n",
    "    plot=True,\n",
    "    stratified=False,\n",
    "    verbose=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_value = cv_data['test-Logloss-mean'].min()\n",
    "best_iter = cv_data['test-Logloss-mean'].values.argmin()\n",
    "\n",
    "print('Best validation Logloss score, stratified: {:.4f}±{:.4f} on step {}'.format(\n",
    "    best_value,\n",
    "    cv_data['test-Logloss-std'][best_iter],\n",
    "    best_iter)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sklearn Grid Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "param_grid = {\n",
    "    \"learning_rate\": [0.001, 0.01, 0.5],\n",
    "}\n",
    "\n",
    "clf = CatBoostClassifier(\n",
    "    iterations=20, \n",
    "    cat_features=cat_features, \n",
    "    verbose=20\n",
    ")\n",
    "grid_search = GridSearchCV(clf, param_grid=param_grid, cv=3)\n",
    "results = grid_search.fit(X_train, y_train)\n",
    "results.best_estimator_.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overfitting Detector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_with_early_stop = CatBoostClassifier(\n",
    "    iterations=200,\n",
    "    learning_rate=0.5,\n",
    "    early_stopping_rounds=20\n",
    ")\n",
    "\n",
    "model_with_early_stop.fit(\n",
    "    train_pool,\n",
    "    eval_set=validation_pool,\n",
    "    verbose=False,\n",
    "    plot=True\n",
    ");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model_with_early_stop.tree_count_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Overfitting Detector with eval metric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "model_with_early_stop = CatBoostClassifier(\n",
    "    eval_metric='AUC',\n",
    "    iterations=200,\n",
    "    learning_rate=0.5,\n",
    "    early_stopping_rounds=20\n",
    ")\n",
    "model_with_early_stop.fit(\n",
    "    train_pool,\n",
    "    eval_set=validation_pool,\n",
    "    verbose=False,\n",
    "    plot=True\n",
    ");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model_with_early_stop.tree_count_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = CatBoostClassifier(iterations=200, learning_rate=0.03)\n",
    "model.fit(train_pool, verbose=50);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model.predict(X_validation))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model.predict_proba(X_validation))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_pred = model.predict(\n",
    "    X_validation,\n",
    "    prediction_type='RawFormulaVal'\n",
    ")\n",
    "\n",
    "print(raw_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import exp\n",
    "\n",
    "sigmoid = lambda x: 1 / (1 + exp(-x))\n",
    "\n",
    "probabilities = sigmoid(raw_pred)\n",
    "\n",
    "print(probabilities)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select decision boundary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://habrastorage.org/webt/y4/1q/yq/y41qyqfm9mcerp2ziys48phpjia.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from catboost.utils import get_roc_curve\n",
    "from catboost.utils import get_fpr_curve\n",
    "from catboost.utils import get_fnr_curve\n",
    "\n",
    "curve = get_roc_curve(model, validation_pool)\n",
    "(fpr, tpr, thresholds) = curve\n",
    "\n",
    "(thresholds, fpr) = get_fpr_curve(curve=curve)\n",
    "(thresholds, fnr) = get_fnr_curve(curve=curve)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(16, 8))\n",
    "style = {'alpha':0.5, 'lw':2}\n",
    "\n",
    "plt.plot(thresholds, fpr, color='blue', label='FPR', **style)\n",
    "plt.plot(thresholds, fnr, color='green', label='FNR', **style)\n",
    "\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xticks(fontsize=16)\n",
    "plt.yticks(fontsize=16)\n",
    "plt.grid(True)\n",
    "plt.xlabel('Threshold', fontsize=16)\n",
    "plt.ylabel('Error Rate', fontsize=16)\n",
    "plt.title('FPR-FNR curves', fontsize=20)\n",
    "plt.legend(loc=\"lower left\", fontsize=16);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from catboost.utils import select_threshold\n",
    "\n",
    "print(select_threshold(model, validation_pool, FNR=0.01))\n",
    "print(select_threshold(model, validation_pool, FPR=0.01))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metric evaluation on a new dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics = model.eval_metrics(\n",
    "    data=validation_pool,\n",
    "    metrics=['Logloss','AUC'],\n",
    "    ntree_start=0,\n",
    "    ntree_end=0,\n",
    "    eval_period=1,\n",
    "    plot=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('AUC values:\\n{}'.format(np.array(metrics['AUC'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature importances"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction values change"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Default feature importances for binary classification is PredictionValueChange - how much on average does the model change when the feature value changes.\n",
    "These feature importances are non negative.\n",
    "They are normalized and sum to 1, so you can look on these values like percentage of importance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.array(model.get_feature_importance(prettified=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loss function change"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The non default feature importance approximates how much the optimized loss function will change if the value of the feature changes.\n",
    "This importances might be negative if the feature has bad influence on the loss function.\n",
    "The importances are not normalized, the absolute value of the importance has the same scale as the optimized loss value.\n",
    "To calculate this importance value you need to pass train_pool as an argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.array(model.get_feature_importance(\n",
    "    data=train_pool, \n",
    "    type='LossFunctionChange', \n",
    "    prettified=True\n",
    "))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shap values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model.predict_proba([X.iloc[1,:]]))\n",
    "print(model.predict_proba([X.iloc[91,:]]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shap_values = model.get_feature_importance(\n",
    "    data=validation_pool, \n",
    "    type='ShapValues'\n",
    ")\n",
    "expected_value = shap_values[0,-1]\n",
    "shap_values = shap_values[:,:-1]\n",
    "print(shap_values.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "proba = model.predict_proba([X.iloc[1,:]])[0]\n",
    "raw = model.predict([X.iloc[1,:]], prediction_type='RawFormulaVal')[0]\n",
    "print('Probabilities', proba)\n",
    "print('Raw formula value %.4f' % raw)\n",
    "print('Probability from raw value %.4f' % sigmoid(raw))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shap\n",
    "\n",
    "shap.initjs()\n",
    "shap.force_plot(expected_value, shap_values[1,:], X_validation.iloc[1,:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "proba = model.predict_proba([X.iloc[91,:]])[0]\n",
    "raw = model.predict([X.iloc[91,:]], prediction_type='RawFormulaVal')[0]\n",
    "print('Probabilities', proba)\n",
    "print('Raw formula value %.4f' % raw)\n",
    "print('Probability from raw value %.4f' % sigmoid(raw))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shap\n",
    "shap.initjs()\n",
    "shap.force_plot(expected_value, shap_values[91,:], X_validation.iloc[91,:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shap.summary_plot(shap_values, X_validation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Snapshotting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "#!rm 'catboost_info/snapshot.bkp'\n",
    "\n",
    "model = CatBoostClassifier(\n",
    "    iterations=100,\n",
    "    save_snapshot=True,\n",
    "    snapshot_file='snapshot.bkp',\n",
    "    snapshot_interval=1\n",
    ")\n",
    "\n",
    "model.fit(train_pool, eval_set=validation_pool, verbose=10);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = CatBoostClassifier(iterations=10)\n",
    "model.fit(train_pool, eval_set=validation_pool, verbose=False)\n",
    "model.save_model('catboost_model.bin')\n",
    "model.save_model('catboost_model.json', format='json')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.load_model('catboost_model.bin')\n",
    "print(model.get_params())\n",
    "print(model.learning_rate_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameter tunning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tunned_model = CatBoostClassifier(\n",
    "    iterations=1000,\n",
    "    learning_rate=0.03,\n",
    "    depth=6,\n",
    "    l2_leaf_reg=3,\n",
    "    random_strength=1,\n",
    "    bagging_temperature=1\n",
    ")\n",
    "\n",
    "tunned_model.fit(\n",
    "    X_train, y_train,\n",
    "    cat_features=cat_features,\n",
    "    verbose=False,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    plot=True\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Speeding up the training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fast_model = CatBoostClassifier(\n",
    "    boosting_type='Plain',\n",
    "    rsm=0.5,\n",
    "    one_hot_max_size=50,\n",
    "    leaf_estimation_iterations=1,\n",
    "    max_ctr_complexity=1,\n",
    "    iterations=100,\n",
    "    learning_rate=0.3,\n",
    "    bootstrap_type='Bernoulli',\n",
    "    subsample=0.5\n",
    ")\n",
    "fast_model.fit(\n",
    "    X_train, y_train,\n",
    "    cat_features=cat_features,\n",
    "    verbose=False,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    plot=True\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reducing model size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "small_model = CatBoostClassifier(\n",
    "    learning_rate=0.03,\n",
    "    iterations=500,\n",
    "    model_size_reg=50,\n",
    "    max_ctr_complexity=1,\n",
    "    ctr_leaf_count_limit=100\n",
    ")\n",
    "small_model.fit(\n",
    "    X_train, y_train,\n",
    "    cat_features=cat_features,\n",
    "    verbose=False,\n",
    "    eval_set=(X_validation, y_validation),\n",
    "    plot=True\n",
    ");"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {
    "1057714ebc614324aa3ba2cf69408966": {
     "views": [
      {
       "cell_index": 17
      }
     ]
    },
    "8381e9eed05f4a03905ae8a56d7ab4ea": {
     "views": [
      {
       "cell_index": 48
      }
     ]
    },
    "f49684e8c5c44241bfe2c7f577f5cb41": {
     "views": [
      {
       "cell_index": 53
      }
     ]
    }
   },
   "version": "1.2.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
